Consider data on GDP and Life Expectancy of all countries between the years 1960 and 2018 (Bank 2018). Specifically, the data consists of \(n\) observations \((Y_1, \vec{X_1}),...,(Y_n,\vec{X_n})\), where \(Y\) represents Life Expectancy and the vector \(\vec{X}\) represents GDP and Year.
On a first pass visualization of this data, we find a mostly linear relationship between Life Expectancy & Year (1), and a mostly logarithmic relationship between Life Expectancy & GDP (2).
We would like to cluster the countries in the dataset into subgroups, each with distinctive relationships among the covariates. We further expect the relationship between Life Expectancy and both Year and GDP to be monotone non-decreasing, and we want to allow for distinct non-linearities one or both of these relationships within each cluster of countries.
One model that can accomplish this clustering task while incorporating monotonicity shape constraints within the submodels is a Mixture of Regressions, wherin each Regression is a Partial Linear Model. In the rest of this document, we elaborate the theory behind this model and demonstrate how such a model can be fit by an extension of the flexmix package in R. For a broader introduction to the use of this flexmix extension, see its github page.
We begin with the component model, the partial linear model. The generalized partial linear model is an additive regression model with some finite combination of linear and non-linear components, which can be denoted thus:
\[\begin{equation} \tag{1} Y = \sum_{h=1}^{p} g_{h} (X_{i}) \ +\ \sum_{j=1}^{q} \beta_{j} X_{j} \ +\ \epsilon \end{equation}\]where the model has i non-linear components and j linear components, and each \(g_{i}()\) is some nonparametric function of \(X_{i}\)
When the non-linear components of the partial linear model have monotone shape constraints, we choose to estimate each \(g_{i}()\) using the Pool Adjacent Violators Algorithm (PAVA) or Cyclic Pool Adjacent Violators Algorithm (CPAV). The PAVA – for univariate monotone regression (2) – returns a step-function fit in a single iteration without having to select a bandwidth, while the CPAV – for multivariable monotone regression (3) – returns a sum of step-functions by iterating through and updating univariate monotone functions until convergence.
\[\begin{equation} \tag{2} Y = g_{h} (X_{h}) \ +\ \epsilon \end{equation}\] \[\begin{equation} \tag{3} Y = \sum_{h=1}^{p} g_{h} (X_{h}) \ +\ \epsilon \end{equation}\]The MLE of the entire partial linear model is obtained via a backfitting algorithm suggested by Cheng (Cheng 2008), sequentially updating the linear and non-linear components of the partial linear model in a two-step process (2, 3) until convergence.
\[\begin{equation} \tag{2} (I) \ \ \ \ \ \ \ \ \ \{g_1,...,g_p\} = \underset{g_1:g_p}{\operatorname{argmin}} \sum_{i=1}^n\left(y_i-\sum_{j=1}^{q} \beta_{j} x_{ij}-\sum_{h=1}^{p} g_{h} (x_{ih})\right) \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ holding \ \ \ \vec\beta \ \ \ fixed \end{equation}\] \[\begin{equation} \tag{3} (II) \ \ \ \ \ \ \ \ \ \vec\beta = \underset{\beta}{\operatorname{argmin}} \sum_{i=1}^n\left(y_i-\sum_{h=1}^{p} g_{h} (x_{ih}) - \sum_{j=1}^{q} \beta_{j} x_{ij}\right) \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ holding \ \ \ \{g_1,...,g_p\} \ \ \ fixed \end{equation}\]To demonstrate the functioning of this estimator, we fit a Partial Linear Model with randomly generated pseudo-data in which there is one covariate with a non-linear monotone effect on the dependent variable. We verify in the subsequent plot that the model indeed finds the correct monotone shape (cubic) and the correct linear coefficients (\(1.5, -1.5, -1, 1\))
# data with 1 monotone covariate
################
X <- cbind(
runif(1000, -5, 5),
runif(1000, -10, 10),
runif(1000, -100, 100),
runif(1000, -100, 100),
runif(1000, -100, 100)
)
Y1 <- (X[,1])^3 + 1.5*X[,2] - 1.5*X[,3] - 1*X[,4] + X[,5] + rnorm(1000, 0, 10)
df_1 <- data.frame(Y1, X)
names(df_1) <- c("Y", "X1", "X2", "X3", "X4", "X5")
################
# build model
m1 <- part_fit(x = df_1[,2:6], y = df_1$Y, mon_inc_index = 1)
# plot fitted model
plot(m1)
# print coefficient estimates
m1$coef
## X2 X3 X4 X5
## 1.551839 -1.506049 -1.003185 1.004205
We can optionally fit more complex partial linear models by, for example, adding monotone covariates. Below, we fit and plot a partial linear model with 1 monotone non-decreasing covariate and 2 monotone non-increasing covariates.
# data with 3 monotone covariates
################
X <- cbind(
runif(100, -5, 5),
runif(100, -10, 10),
runif(100, -100, 100),
runif(100, -100, 100),
runif(100, -100, 100)
)
W <- runif(100, 0, 1)
W[c(10, 20, 30, 40, 50, 60, 70)] <- 0.000000e+00
Y <- (X[,1])^3 - X[,2]^3 - 1.5*X[,3] - 2*X[,4] + X[,5] + rnorm(100, 0, 1000)
df_2 <- data.frame(Y, X, W)
names(df_2) <- c("Y", "X1", "X2", "X3", "X4", "X5", "W")
################
# build model
m2 <- part_fit(x = df_2[,2:6], y = df_2$Y, mon_inc_index = 1, mon_dec_index = c(2,3), wates = df_2$W)
# plot fitted model
plot(m2)
Follow these links for a more in-depth review of Finite Mixture Models, and Mixture of Regression Models.
As in the motivating example, suppose we observe \(Y_i,...,Y_n\) and associated \(X_i,...,X_n\). We assume that each observed set \((Y_i, \vec{X_i})\) belongs to one of \(\{1,...,k\}\) unobserved components, for some positive integer \(k\), and we denote this by a vector \(Z\) where \(Z_i \ \epsilon\ \{1,...,k\}\).
Without specifying the exact form of the regression model \(Y \sim f_k(\cdot|\vec{X})\) for component \(k\), we can assume some vector of regression model parameters \(\Theta\) and write the likelihood of the mixture model as such:
\[L(\pi) \ =\ \prod_{i=1}^n \sum_{j=1}^k \pi_j P_j(Y_i\ |\ \vec{X_i},\ \Theta_j,\ Z_{ik})\] We find the asymptotic global maximum (Wu 1983) of the likelihood function via the EM algorithm, which estimates the following:
By extension, the EM algorithm also allows us to compute the marginal probability of each \(Y_i\):
\[\begin{equation} \tag{4} P(Y_i = y) \ =\ \sum_{k=1}^{K}\pi_kP(Y_i = y\ |\ X_i = x, \Theta_k,\ Z_{ik}) \end{equation}\]
Here we demonstrate the results of the estimator described above, combinbing the the EM algorithm, the backfitting algorithms of Cheng (Cheng 2008), and the CPAV algorithm of Bacchetti (Bacchetti 1989). We begin by modelling randomly generated pseudo-data from 4 latent categories
# data with 4 latent categories
################
X <- cbind(
runif(1000, -5, 5),
runif(1000, -10, 10),
runif(1000, -100, 100),
runif(1000, -100, 100),
runif(1000, -100, 100)
)
Y1 <- (X[1:250,1])+3 + 1.5*X[1:250,2] - 1.5*X[1:250,3] - 1*X[1:250,4] + X[1:250,5] + rnorm(250, 0, 3) # component 1
Y2 <- (X[251:500,1])^3 + 3*X[251:500,2] + 2*X[251:500,3] - 2*X[251:500,4] + 2*X[251:500,5] + rnorm(250, 0, 4) # component 2
Y3 <- 2*((X[501:750,1])+5) - 2*X[501:750,2] - 1*X[501:750,3] + 2*X[501:750,4] + 4*X[501:750,5] + rnorm(250, 0, 3) # component 3
Y4 <- 2*((X[751:1000,1])-5) - 3*X[751:1000,2] - 3*X[751:1000,3] - 3*X[751:1000,4] + 3*X[751:1000,5] + rnorm(250, 0, 4) # component 4
df_3 <- data.frame(c(Y1, Y2, Y3, Y4), X)
names(df_3) <- c("Y", "X1", "X2", "X3", "X4", "X5")
################
# build model
m3 <- flexmix(Y ~ ., data = df_3, k = 4, model = mono_reg(mon_inc_names = "X1"), control = list(minprior = 0.1))
# plot fitted model
plot(m3, ylim=c(-100,100), palet="Dark2", root_scale="sqrt")
Next, we return to the motivating example, building four step models: with and without an intercept, and alternately with Year or GDP as the monotone covariate. Each model is in fact a series of 21 mixture models, 3 for each of \(k = 1,...,7\). For each series, we plot the AIC and BIC per \(k\), the rootogram of the model with the lowest AIC, and the fitted monotone functions for the model with the lowest AIC.
# Life expectancy models
m4 <- stepFlexmix(LifeExpectancy ~ .-1-Country.Name|Country.Name, data = le, k = 1:5, model = mono_reg(mon_inc_names = "Year")) # step model | no intercept | grouped | monotone year
m5 <- stepFlexmix(LifeExpectancy ~ .-Country.Name|Country.Name, data = le, k = 1:5, model = mono_reg(mon_inc_names = "Year")) # step model | with intercept | grouped | monotone year
m6 <- stepFlexmix(LifeExpectancy ~ .-1-Country.Name|Country.Name, data = le, k = 1:5, model = mono_reg(mon_inc_names = "GDP")) # step model | no intercept | grouped | monotone GDP
m7 <- stepFlexmix(LifeExpectancy ~ .-Country.Name|Country.Name, data = le, k = 1:5, model = mono_reg(mon_inc_names = "GDP")) # step model | with intercept | grouped | monotone GDP
Finally, we plot the distribution of clusters within the lowest-AIC model of each step-model series, projected onto a world map. In these world map plots, the colors of each cluster span a spectrum from white to full-color, representing the strength of the posterior and the confidence of the model in placing a given country within a given cluster. The world-maps indicate what the rootograms had previously indicated, namely that the resulting models are extremely confident about the clustering of nearly all countries.
Bacchetti, Peter. 1989. “Additive Isotonic Models.” Journal of the American Statistical Association 84 (405): 289–94.
Bank, World. 2018. “World Bank Open Data.” https://data.worldbank.org/indicator/NY.GDP.PCAP.CD.
Cheng, Guang. 2008. “Semiparametric Additive Isotonic Regression.” Journal of Statistical Planning and Inference, no. 139: 1980–91.
Wu, C. F. Jeff. 1983. “On the Convergence Properties of the Em Algorithm.” Annals of Statistics 11 (1): 95–103.